Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

نویسندگان

  • Ofer Dekel
  • Ambuj Tewari
  • Raman Arora
چکیده

Online learning algorithms are designed to learn even when their input is generated by an adversary. The widely-accepted formal definition of an online algorithm’s ability to learn is the game-theoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adversary is allowed to adapt to the online algorithm’s actions. We define the alternative notion of policy regret, which attempts to provide a more meaningful way to measure an online algorithm’s performance against adaptive adversaries. Focusing on the online bandit setting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded memory. On the other hand, if the adversary’s memory is bounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algorithm with a sublinear policy regret bound. We extend this result to other variants of regret, such as switching regret, internal regret, and swap regret.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Following the Perturbed Leader to Gamble at Multi-armed Bandits

Following the perturbed leader (fpl) is a powerful technique for solving online decision problems. Kalai and Vempala [1] rediscovered this algorithm recently. A traditional model for online decision problems is the multi-armed bandit. In it a gambler has to choose at each round one of the k levers to pull with the intention to minimize the cumulated cost. There are four versions of the nonstoch...

متن کامل

Online Learning with Switching Costs and Other Adaptive Adversaries

We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player’s performance using a new notion of regret, also known as policy regret, which better captures the adversary’s adaptiveness to the player’s behavior. In a setting where losses are allowed to drift, we...

متن کامل

Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary

We give an algorithm for the bandit version of a very general online optimization problem considered by Kalai and Vempala [1], for the case of an adaptive adversary. In this problem we are given a bounded set S n of feasible points. At each time step t, the online algorithm must select a point xt S while simultaneously an adversary selects a cost vector ct n. The algorithm then incurs cost ct x...

متن کامل

Nonparametric Contextual Bandit Optimization via Random Approximation

We examine the stochastic contextual bandit problem in a novel continuous-action setting where the policy lies in a reproducing kernel Hilbert space (RKHS). This provides a framework to handle continuous policy and action spaces in a tractable manner while retaining polynomial regret bounds, in contrast with much prior work in the continuous setting. We extend an optimization perspective that h...

متن کامل

High-Probability Regret Bounds for Bandit Online Linear Optimization

We present a modification of the algorithm of Dani et al. [8] for the online linear optimization problem in the bandit setting, which with high probability has regret at most O∗( √ T ) against an adaptive adversary. This improves on the previous algorithm [8] whose regret is bounded in expectation against an oblivious adversary. We obtain the same dependence on the dimension (n) as that exhibit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012